{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dYwg9btt1wJH"
   },
   "source": [
    "# BERT\n",
    "\n",
    "**Bidirectional Encoder Representations from Transformers.**\n",
    "\n",
    "_ | _\n",
    "- | -\n",
    "![alt](https://pytorch.org/assets/images/bert1.png) | ![alt](https://pytorch.org/assets/images/bert2.png)\n",
    "\n",
    "\n",
    "### **Overview**\n",
    "\n",
    "BERT was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin *et al.* The model is based on the Transformer architecture introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani *et al.* and has led to significant improvements in a wide range of natural language tasks.\n",
    "\n",
    "At the highest level, BERT maps from a block of text to a numeric vector which summarizes the relevant information in the text. \n",
    "\n",
    "What is remarkable is that numeric summary is sufficiently informative that, for example, the numeric summary of a paragraph followed by a reading comprehension question contains all the information necessary to satisfactorily answer the question.\n",
    "\n",
    "#### **Transfer Learning**\n",
    "\n",
    "BERT is a great example of a paradigm called *transfer learning*, which has proved very effective in recent years. In the first step, a network is trained on an unsupervised task using massive amounts of data. In the case of BERT, it was trained to predict missing words and to detect when pairs of sentences are presented in reversed order using all of Wikipedia. This was initially done by Google, using intense computational resources.\n",
    "\n",
    "Once this network has been trained, it is then used to perform many other supervised tasks using only limited data and computational resources: for example, sentiment classification in tweets or quesiton answering. The network is re-trained to perform these other tasks in such a way that only the final, output parts of the network are allowed to adjust by very much, so that most of the \"information'' originally learned the network is preserved. This process is called *fine tuning*. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TgWpXdSIl5KL"
   },
   "source": [
    "##Getting to know BERT\n",
    "\n",
    "BERT, and many of its variants, are made avialable to the public by the open source [Huggingface Transformers](https://huggingface.co/transformers/) project. This is an amazing resource, giving researchers and practitioners easy-to-use access to this technology. \n",
    "\n",
    "In order to use BERT for modeling, we simply need to download the pre-trained neural network and fine tune it on our dataset, which is illustrated below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yDVt-3-GaS7q"
   },
   "outputs": [],
   "source": [
    "!pip install transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hGRPttHC5WJU"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from transformers import TFBertModel, BertTokenizer\n",
    "\n",
    "# Formatting tools\n",
    "from pprint import pformat \n",
    "np.set_printoptions(threshold=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 164,
     "referenced_widgets": [
      "cbdc3748eaaf41fbad4d0d2797addfa8",
      "4e32ba2928844536a9dd72607a3e73db",
      "c8a2efe909d64a7c9e8b61cbc47dc5a3",
      "7f03583b8431482db209e04218fe648b",
      "b077dccc9bea40bbb9707819175e98fb",
      "64656ec9fa9c408c87283550e82a1744",
      "2bd8b5bdb32049c586c41e33865ac2d2",
      "d02573b49aea49b9ac8ecc5f407c3892",
      "28143879288444d992a7784577a4c439",
      "6125414b428b41eeb9a15ed34e344d82",
      "625b093522bb4b43a95c52fccd71670b",
      "f2e69783c4eb44d28987cc665be736d1",
      "0f3aa953da61498c95c64e1f5bbddea2",
      "bf8c090ff02f43efa95508bc988e6905",
      "a63ad4f4c32d4c15922ac9f843c09338",
      "609b9fd1793a4a1197feb03f69ea770d",
      "b8e28b2d74d14abf8f1ef9d87d0a7bdb",
      "0d6c87d1032b460da2ace839eedbdf83",
      "75f6261ac0b849fbbad1bde5dda09646",
      "d2e942187d004b91991189482fafc289",
      "0fa197df7ffb4a0bbdfd468d2eaf685b",
      "108aa61ec3444a649ab9a060d52b650c",
      "f9f47a47c1b9423ab03330f4162522bd",
      "6709579748bf44d295b7e13c1a25d100"
     ]
    },
    "id": "8aNZcJwIcL5T",
    "outputId": "f63be5c5-c50f-47cc-e91e-0eb27fffa1fb"
   },
   "outputs": [],
   "source": [
    "# Download text pre-processor (\"tokenizer\")\n",
    "tokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 220,
     "referenced_widgets": [
      "d02ee926cf884442bd35e07d7e36a7c2",
      "6d524dfefa61472d9655124bef5871cf",
      "469e839fa7054be9baaf98dbb30f8d06",
      "e6b76e2df4704b79b86896995335a33b",
      "60bf08260d9e414b8897efb2256b259c",
      "f3d52698298e4ee095745e6609dfe7aa",
      "1aec568857104ab99266662964f43d76",
      "21abb705427546d082b28a38794bce6b",
      "b9a1e25fa7744bf6b7bbe7380a3ecb5a",
      "22a732d9244b46eb8fd92335cb82989f",
      "647431fbb8f140ed978800943922a673",
      "c0c8d2086c954b568817777bbb9719f2",
      "b0e757d5b80349f6a7db47b33d85c0ec",
      "b9e5538c9ba74377a0134c9a9dcc6346",
      "1dca0046fa614bf1b73eadcbaceb9db7",
      "4b4371d3b27b4ef68be822fe6765eef6"
     ]
    },
    "id": "K_ljmPI3cEQI",
    "outputId": "9ba35440-2d3f-49de-fa7e-cb43219f6044"
   },
   "outputs": [],
   "source": [
    "# Download BERT model\n",
    "bert = TFBertModel.from_pretrained(\"bert-base-uncased\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "26mRwUFwardQ"
   },
   "source": [
    "### Tokenization\n",
    "\n",
    "The first step in using BERT (or any similar text embedding tool) is to *tokenize* the data. This step standardizes blocks of text, so that meaningless differences in text presentation don't affect the behavior of our algorithm. \n",
    "\n",
    "Typically the text is transformed into a sequence of 'tokens,' each of which corresponds to a numeric code. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "vnLlb5peTwqM",
    "outputId": "75dac2f5-a361-49ef-964a-3c4d2f718cd3"
   },
   "outputs": [],
   "source": [
    "# Let's try it out!\n",
    "s = \"What happens to this string?\"\n",
    "print('Original String: \\n\\\"{}\\\"\\n'.format(s))\n",
    "tensors = tokenizer(s)\n",
    "print('Numeric encoding: \\n' + pformat(tensors))\n",
    "\n",
    "# What does this mean?\n",
    "print('\\nActual tokens:')\n",
    "tokenizer.convert_ids_to_tokens(tensors['input_ids'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JJaz6eEocefa"
   },
   "source": [
    "### BERT in a nutshell\n",
    "\n",
    "Once we have our numeric tokens, we can simply plug them into the BERT network and get a numeric vector summary. Note that in applications, the BERT summary will be \"fine tuned\" to a particular task, which hasn't happened yet. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Q1ODAgBMa3Zg",
    "outputId": "10b55eb6-5603-416b-d2b3-43a1a136bbb3"
   },
   "outputs": [],
   "source": [
    "print('Input: \"What happens to this string?\"\\n')\n",
    "\n",
    "# Tokenize the string\n",
    "tensors_tf = tokenizer(\"What happens to this string?\", return_tensors=\"tf\")\n",
    "\n",
    "# Run it through BERT\n",
    "output = bert(tensors_tf)\n",
    "\n",
    "# Inspect the output\n",
    "_shape = output['pooler_output'].shape\n",
    "print(\n",
    "\"\"\"Output type: {}\\n \n",
    "Output shape: {}\\n\n",
    "Output preview: {}\\n\"\"\"\n",
    ".format(\n",
    "type(output['pooler_output']),\n",
    " _shape, \n",
    "pformat(output['pooler_output'].numpy())))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y_CnEClsl_1p"
   },
   "source": [
    "# A practical introduction to BERT\n",
    "\n",
    "In the next part of the notebook, we are going to explore how a tool like BERT may be useful to an econometrician. \n",
    "\n",
    "In particular, we are going to apply BERT to a subset of data from the Amazon marketplace consisting of roughly 10,000 listings for products in the toy category. Each product comes with a text description, a price, and a number of times reviewed (which we'll use as a proxy for demand / market share). \n",
    "\n",
    "**Problem 1**:\n",
    "What are some issues you may anticipate when using number of reviews as a proxy for demand or market share?\n",
    "\n",
    "### Getting to know the data\n",
    "\n",
    "First, we'll download and clean up the data, and do some preliminary inspection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 204
    },
    "id": "1Su5vOGhD3Df",
    "outputId": "4fc4b503-3206-4f34-cd12-33f3d631843e"
   },
   "outputs": [],
   "source": [
    "# Download data\n",
    "DATA_URL = 'https://www.dropbox.com/s/on2nzeqcdgmt627/amazon_co-ecommerce_sample.csv?dl=1'\n",
    "data = pd.read_csv(DATA_URL)\n",
    "\n",
    "# Clean numeric data fields\n",
    "data['number_of_reviews'] = pd.to_numeric(data\n",
    "                              .number_of_reviews\n",
    "                              .str.replace(r\"\\D+\",''))\n",
    "data['price'] = (data\n",
    "                    .price\n",
    "                    .str.extract(r'(\\d+\\.*\\d+)')\n",
    "                    .astype('float'))\n",
    "\n",
    "# Drop products with very few reviews\n",
    "data = data[data['number_of_reviews'] > 0]\n",
    "\n",
    "# Compute log prices\n",
    "data['ln_p'] = np.log(data.price)\n",
    "\n",
    "# Impute market shares\n",
    "data['ln_q'] =  np.log(data['number_of_reviews'] / data['number_of_reviews'].sum())\n",
    "\n",
    "# Collect relevant text data\n",
    "data[['text']] = (data[[\n",
    "                    'product_name',\n",
    "                    'product_description']]\n",
    "                  .astype('str')\n",
    "                  .agg(' | '.join, axis=1))\n",
    " \n",
    "#  Drop irrelevant data and inspect\n",
    "data = data[['text','ln_p','ln_q']]\n",
    "data = data.dropna()\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CDlHPQZfcv7I"
   },
   "source": [
    "Let's make a two-way scatter plot of prices and (proxied) market shares. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 297
    },
    "id": "k98SQ_ySE68v",
    "outputId": "25da1162-705f-4d39-90db-0916c2cb9fe7"
   },
   "outputs": [],
   "source": [
    "# Plot log price against market share\n",
    "data.plot.scatter('ln_p','ln_q')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cr2Ha215l4QK"
   },
   "source": [
    "Let's begin with a simple prediction task. We will discover how well can we explain the price of these products using their textual descriptions.\n",
    "\n",
    "**Problem 2**:\n",
    " 1. Build a linear model that explains the price of each product using it's text embedding vector as the explanatory variables. \n",
    "\n",
    " 2. Build a two-layer perceptron neural network that explains the price of each product using the text embedding vector as input (see example code below).\n",
    "<!-- 3. Now, instead of taking the text embeddings as fixed, we allow the it to ``fine tune.'' Construct a neural network by combining the (pre-loaded) BERT network -->\n",
    "\n",
    " 3. Report the $R^2$ of both approaches. \n",
    "\n",
    " 4. As an econometrician, what are some concerns you may have about how to interpret these models?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "McYQ6yIPBfK6"
   },
   "outputs": [],
   "source": [
    "## First, let's split and preprocess (tokenize) the text to prepare it for BERT \n",
    "\n",
    "main = data.sample(frac=0.6,random_state=200)\n",
    "holdout = data.drop(main.index)\n",
    "\n",
    "tensors = tokenizer(\n",
    "    list(main[\"text\"]),\n",
    "    padding=True, \n",
    "    truncation=True, \n",
    "    max_length=128,\n",
    "    return_tensors=\"tf\")\n",
    "\n",
    "ln_p = main[\"ln_p\"]\n",
    "ln_q = main[\"ln_q\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Ck1xqRIrmx8I",
    "outputId": "09182b7c-3359-4589-d1bb-1bb11e2b31db"
   },
   "outputs": [],
   "source": [
    "## Now let's prepare our model\n",
    "\n",
    "from tensorflow.keras import Model, Input\n",
    "from tensorflow.keras.layers import Dense, Dropout, Concatenate\n",
    "\n",
    "input_ids = Input(shape=(128,), dtype=tf.int32)\n",
    "token_type_ids = Input(shape=(128,), dtype=tf.int32)\n",
    "attention_mask = Input(shape=(128,), dtype=tf.int32)\n",
    "\n",
    "# First we compute the text embedding\n",
    "Z = bert(input_ids, token_type_ids, attention_mask)\n",
    "\n",
    "# We want the \"pooled / summary\" embedding, not individual word embeddings\n",
    "Z = Z[1]\n",
    "\n",
    "# Then we do a regular regression\n",
    "Z = Dense(128, activation='relu')(Z)\n",
    "Z = Dropout(0.2)(Z)\n",
    "Z = Dense(32, activation='relu')(Z)\n",
    "Z = Dropout(0.2)(Z)\n",
    "Z = Dense(8, activation='relu')(Z)\n",
    "ln_p_hat = Dense(1, activation='linear')(Z)\n",
    "\n",
    "PricePredictionNetwork = Model([input_ids, token_type_ids, attention_mask], ln_p_hat)\n",
    "PricePredictionNetwork.compile(optimizer='adam', loss='mse')\n",
    "PricePredictionNetwork.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "rU9pYwQwz7Ib",
    "outputId": "91ac7c07-330f-43e4-de9c-a25fdbdec788"
   },
   "outputs": [],
   "source": [
    "PricePredictionNetwork.fit(\n",
    "                [tensors['input_ids'], tensors['token_type_ids'], tensors['attention_mask']], \n",
    "                ln_p,\n",
    "                epochs=3,\n",
    "                batch_size=16,\n",
    "                shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bafy9ftcoBed"
   },
   "source": [
    "Now, let's go one step further and construct a DML estimator of the average price elasticity. In particular, we will model market share $q_i$ as\n",
    "$$\\ln q_i = \\alpha + \\beta \\ln p_i + \\psi(d_i) + \\epsilon_i,$$ where $d_i$ denotes the description of product $i$ and $\\psi$ is the composition of text embedding and a two-layer perceptron. \n",
    "\n",
    "**Problem 3**: \n",
    " 1. Split the sample in two, and predict $\\ln p_i$ and $\\ln q_i$ using $d_i$ with a two-layer perceptron as before, using the main sample.\n",
    " 2. In the holdout sample, perform an OLS regression of the residual of $\\ln q_i$ on the residual of $\\ln p_i$ (using the previous problem's model). \n",
    " 3. What do you find? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Qiteu6FaoctV",
    "outputId": "2fa459bf-588b-4b80-ad20-2e173adf3d44"
   },
   "outputs": [],
   "source": [
    "## Build the quantity prediction network\n",
    "\n",
    "# Initialize new BERT model from original\n",
    "bert2 = TFBertModel.from_pretrained(\"bert-base-uncased\")\n",
    "\n",
    "# Define inputs\n",
    "input_ids = Input(shape=(128,), dtype=tf.int32)\n",
    "token_type_ids = Input(shape=(128,), dtype=tf.int32)\n",
    "attention_mask = Input(shape=(128,), dtype=tf.int32)\n",
    "\n",
    "# First we compute the text embedding\n",
    "Z = bert2(input_ids, token_type_ids, attention_mask)\n",
    "\n",
    "# We want the \"pooled / summary\" embedding, not individual word embeddings\n",
    "Z = Z[1]\n",
    "\n",
    "# Construct network\n",
    "Z = Dense(128, activation='relu')(Z)\n",
    "Z = Dropout(0.2)(Z)\n",
    "Z = Dense(32, activation='relu')(Z)\n",
    "Z = Dropout(0.2)(Z)\n",
    "Z = Dense(8, activation='relu')(Z)\n",
    "ln_q_hat = Dense(1, activation='linear')(Z)\n",
    "\n",
    "# Compile model and optimization routine\n",
    "QuantityPredictionNetwork = Model([input_ids, token_type_ids, attention_mask], ln_q_hat)\n",
    "QuantityPredictionNetwork.compile(optimizer='adam', loss='mse')\n",
    "QuantityPredictionNetwork.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "aaxHV0gGMqpw",
    "outputId": "53b03d31-82b3-4934-d53d-c375a70a6363"
   },
   "outputs": [],
   "source": [
    "## Fit the quantity prediction network in the main sample\n",
    "QuantityPredictionNetwork.fit(\n",
    "                [tensors['input_ids'], tensors['token_type_ids'], tensors['attention_mask']], \n",
    "                ln_q,\n",
    "                epochs=3,\n",
    "                batch_size=16,\n",
    "                shuffle=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "YADpNj0jMygZ",
    "outputId": "9f603632-ba59-474f-b328-ff01d829d511"
   },
   "outputs": [],
   "source": [
    "## Predict in the holdout sample, residualize and regress\n",
    "\n",
    "# Preprocess holdout sample\n",
    "tensors_holdout = tokenizer(\n",
    "    list(holdout[\"text\"]),\n",
    "    padding=True, \n",
    "    truncation=True, \n",
    "    max_length=128,\n",
    "    return_tensors=\"tf\")\n",
    "\n",
    "# Compute predictions\n",
    "ln_p_hat_holdout = PricePredictionNetwork.predict([tensors_holdout['input_ids'], tensors_holdout['token_type_ids'], tensors_holdout['attention_mask']])\n",
    "ln_q_hat_holdout = QuantityPredictionNetwork.predict([tensors_holdout['input_ids'], tensors_holdout['token_type_ids'], tensors_holdout['attention_mask']])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "ir-_yAfkPM6f",
    "outputId": "45298cff-9860-4e32-964e-4010bfc51f38"
   },
   "outputs": [],
   "source": [
    "# Compute residuals\n",
    "r_p = holdout[\"ln_p\"] - ln_p_hat_holdout.reshape((-1,))\n",
    "r_q = holdout[\"ln_q\"] - ln_q_hat_holdout.reshape((-1,))\n",
    "\n",
    "# Regress to obtain elasticity estimate\n",
    "beta = np.mean(r_p * r_q) / np.mean(r_p * r_p)\n",
    "\n",
    "# standard error on elastiticy estimate\n",
    "se = np.sqrt(np.mean( (r_p* r_q)**2)/(np.mean(r_p*r_p)**2)/holdout[\"ln_p\"].size)\n",
    "\n",
    "print('Elasticity of Demand with Respect to Price: {}'.format(beta))\n",
    "print('Standard Error: {}'.format(se))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s2rfohtqJzIf"
   },
   "source": [
    "## Clustering Products\n",
    "\n",
    "In this final part of the notebook, we'll illustrate how the BERT text embeddings can be used to cluster products based on their  descriptions.\n",
    "\n",
    "Intiuitively, our neural network has now learned which aspects of the text description are relevant to predict prices and market shares. \n",
    "We can therefore use the embeddings produced by our network to cluster products, and we might expect that the clusters reflect market-relevant information. \n",
    "\n",
    "In the following block of cells, we compute embeddings using our learned models and cluster them using $k$-means clustering with $k=10$. Finally, we will explore how the estimated price elasticity differs across clusters.\n",
    "\n",
    "### Overview of **$k$-means clustering**\n",
    "The $k$-means clustering algorithm seeks to divide $n$ data vectors into $k$ groups, each of which contain points that are \"close together.\"\n",
    "\n",
    "In particular, let $C_1, \\ldots, C_k$ be a partitioning of the data into $k$ disjoint, nonempty subsets (clusters), and define\n",
    "$$\\bar{C_i}=\\frac{1}{\\#C_i}\\left(\\sum_{x \\in C_i} x\\right)$$\n",
    "to be the *centroid* of the cluster $C_i$. The $k$-means clustering score $\\mathrm{sc}(C_1 \\ldots C_k)$ is defined to be\n",
    "$$\\mathrm{sc}(C_1 \\ldots C_k) = \\sum_{i=1}^k \\sum_{x \\in C_i} \\left(x - \\bar{C_i}\\right)^2.$$\n",
    "\n",
    "The $k$-means clustering is then defined to be any partitioning $C^*_1 \\ldots C^*_k$ that minimizes the score $\\mathrm{sc}(-)$.\n",
    "\n",
    "**Problem 4** Show that the $k$-means clustering depends only on the pairwise distances between points. *Hint: verify that $\\sum_{x,y \\in C_i} (x - \\bar{C_i})(y - \\bar{C_i}) = 0$.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "WAm9zh0yJ1Px",
    "outputId": "10c72f7c-8581-445b-c360-5543dafde2f6"
   },
   "outputs": [],
   "source": [
    "## STEP 1: Compute embeddings\n",
    "\n",
    "input_ids = Input(shape=(128,), dtype=tf.int32)\n",
    "token_type_ids = Input(shape=(128,), dtype=tf.int32)\n",
    "attention_mask = Input(shape=(128,), dtype=tf.int32)\n",
    "\n",
    "Y1 = bert(input_ids, token_type_ids, attention_mask)[1]\n",
    "Y2 = bert2(input_ids, token_type_ids, attention_mask)[1]\n",
    "Y = Concatenate()([Y1,Y2])\n",
    "\n",
    "embedding_model = Model([input_ids, token_type_ids, attention_mask], Y)\n",
    "\n",
    "embeddings = embedding_model.predict([tensors_holdout['input_ids'], tensors_holdout['token_type_ids'], tensors_holdout['attention_mask']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "--6o97H4OKug"
   },
   "source": [
    "### Dimension reduction and the **Johnson-Lindenstrauss transform**\n",
    "\n",
    "Our learned embeddings have dimension in the $1000$s, and $k$-means clustering is often an expensive operation. To improve the situation, we will use a neat trick that is used extensively in machine learning applications: the *Johnson-Lindenstrauss transform*. \n",
    "\n",
    "This trick involves finding a low-dimensional linear projection of the embeddings that approximately preserves pairwise distances. \n",
    "\n",
    "In fact, Johnson and Lindenstrauss proved a much more interesting statement: a Gaussian random matrix will *almost always* approximately preserve pairwise distances.\n",
    "\n",
    "**Problem 5** Suppose we have a low-dimensional projection matrix $\\Pi$ that preserves pairwise distances, and let $X$ be the design matrix. Explain how and why we could compute the $k$-means clustering using only the projected data $\\Pi X$. *Hint: use Problem 4.*\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "u3Nq-p4JNsm8"
   },
   "outputs": [],
   "source": [
    "# STEP 2 Make low-dimensional projections\n",
    "from sklearn.random_projection import GaussianRandomProjection\n",
    "\n",
    "jl = GaussianRandomProjection(eps=.25)\n",
    "embeddings_lowdim = jl.fit_transform(embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "8yP9aMA2Vf2n"
   },
   "outputs": [],
   "source": [
    "# STEP 3 Compute clusters\n",
    "from sklearn.cluster import KMeans\n",
    "\n",
    "k_means = KMeans(n_clusters=10)\n",
    "k_means.fit(embeddings_lowdim)\n",
    "cluster_ids = k_means.labels_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ygdm4sofdTBs"
   },
   "outputs": [],
   "source": [
    "# STEP 4 Regress within each cluster\n",
    "\n",
    "betas = np.zeros(10)\n",
    "ses = np.zeros(10)\n",
    "\n",
    "for c in range(10):\n",
    "\n",
    "  r_p_c = r_p[cluster_ids == c]\n",
    "  r_q_c = r_q[cluster_ids == c]\n",
    "\n",
    "  # Regress to obtain elasticity estimate\n",
    "  betas[c] = np.mean(r_p_c * r_q_c) / np.mean(r_p_c * r_p_c)\n",
    "\n",
    "  # standard error on elastiticy estimate\n",
    "  ses[c] = np.sqrt(np.mean( (r_p_c* r_q_c)**2)/(np.mean(r_p_c*r_p_c)**2)/r_p_c.size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 286
    },
    "id": "cF7Y06XnSGbl",
    "outputId": "45788128-c6fa-4fed-ffc9-0a3099a99d2e"
   },
   "outputs": [],
   "source": [
    "# STEP 5 Plot\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "plt.bar(range(10),betas, yerr = ses)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "Feature Engineering with BERT.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
